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ABSTBACT -j^'-^^^ 

• The orientatioja of this report is that of identifying 
educational pro jects which can be .considered truly exeiplary . The 
bulk of the report consists of a 23-step proc<idure for validating the 
effectiveness of educational prograis using existing evaluation data. 
It is not intended as a guide for conducting evaluations but rather 
fofc interpreting data asseibled by others using a wide variety, of 
experiiental and quasi-experiiental designs. As such, its coyerage is 
not^T restricted to "good" designs. It encompasses all of the co»only 
eapioyed evaluation models. The report is coficerned with deficiencies 
and hazards of various designs with emphasis on the weaker ones 
which, as it happens, are also the most feasible in real-world 
settings, the least costly, and the mpet commonly used. The 
appendixes contain project selection criteria worksheets, information 
regarding norm-referenced versus criterion-referenced tests, - 
estimation of treatment effect from the performance of an initially , 
superior comparison group, effects of noncomparable testing dates on 
experimental group versus norm group comparisons, and problems using 
grade-eguivalent scores in evaluating educational gains. Changes from 
the original version include the removal of material which dealt with 
project selection criteria unrelated to cognitive achievement 
benefits. (Author/BC) • 
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- _ - !• INTRODUCTION \. . - 

: This report was developed in conjunction, with Contract No. OEC- 

, ? 0-73-6662 entitled, "The Development of Project Information Packages 
for Effective Approaches in Compensatory Education." As its name 
implies, the contract effort was primarily focused on packaging con- 
cepts and procedures which would facilitate the replic^[tion of sound 
educational practices. There was great concern, however, that the 
prpjec'ts selected for replication should indeed be exemplary in pro- 
ducing significant cognitive achievement benefits. 

:r ^1 Because the selection process was to be based on existing data. 

^derived from a wide variety of experimental and quasl-expjerlmental 
evaluatipn designs, it was clearly necessary not only to establish \ 
criteria for the statistical and educational significance of achieve-r 
Y:\ men t gains but also to define procedures for verifying that these 

criteria were met. This latter task was not regarded lightly, but it 
_v was, the authors felt, something which could be accomplished In a 
^ ^straightforward manner by borrowing liberally from the work of Camp- 
bell and Stanley (1963) and others. It did not seem likely that much 
original work would be required, or that this report would contain any 
significant Information not already present in widely-read evaluation 
texts. These J nitlal Impressions, however, were quickly to be rejected. 

It was not long after work on the validation procedure began 
that it became necessary to put aside the_ well-dpcumented issues of - 
experimental design and statistical inference and to probe the nether- 
world intricacies of achievement test scores and normative data. 
Facts quickly came to light as this exploration proceeded which ap- 
peared to undermine the validity of inferences drawn from nearly all 
locally-conducted evaluations. The problems were so fundamental that 
the authors could not believe they were the first to discover them— 
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yet they were able .tb find nothing in the literature which was more 
than marginaliy relevant •/ : 

Before they started work on the validation procedure, the autKors- / 
considered themselves reasonably sophisticated Ixi both the theory and 
practice*of educational evaluation. There were, however, a number of 
details which had escaped their attention. They were not aware, for 
example, that a child scoring in the lowest quart ile of the national- 
distribution could E.ake gains greater^ than mpnth-for-month over «i 
entire school year -and end up farther below the norm than he began. 
JThey^did not taow that a fiftieth-percent ile third grader could be 
2.5 months below grade level in reading — or that an educational pro- 
gram could appear highly successful if the pre- to posttest interval 
spanned the twelve months fipm 1 May to 1 May but would resemble an , 
instructional disaster if pupils obtained the same scores on tests ad- , 
ministered one day earlier^ - ' ^ ( 

iThese outrageous inc6herencies::were just a few of the "horror 
stories" uncovered in the course of routinely examining real-world 
evaluation studies. The sad part was that these or similar irration- 
alities were so pervasive that not a single evaluation report was found 
i#hich could be accepted at face value! Even more disheartening — many 
> of these evaluations followed procedures officially sanctioned by one^ 
or more presumably authoritative groups of experts. 

With each new discovery it became Increasingly clear that this re- 
port would have new thiivgs to say and would have significant implica- - - 
tions beyond the scope of the effort which spawned it. For this reason, 
it has undergone several revisions intended to increase Its general 
usefulness. The most recent change involved removing as much as possible 
of the material which dealt with project selection criteria unrelated 
to cognitive achievement benefits. Discussion of these criteria (cost, 
availability, and replicability) was clearly specific to the contract 
effort and appeared to detract from the usefulness of the report for a _ 
broader audience* 

While the coverage of the report has changed somewhat from e^^rliet 



versions, its format remains the same. > The largest section of the 
report consists of a 23-step procedure for validating the effectiveness 
of ^educational projects using existing evaluation data. It is not I: 
intended as a guide for conducting evaluations but rather for interpre- 
ting data assembled by others using a wide variety of experimental and 
quasi-experimental designs. As suchr, its coverage is. not restricted 
to "good" designs. It encompasses all of the commonly enq)lqyed^evalua- 
tion models.: ^ 

i ? Some inferences may be drawn regarding the relative usefulness of ;i 
various designs, but the report is really concerned, with deficiencies 
and hazards. It follows,- then, that enq)hasis is placed on the weaker 
designs which, as it happens, ar^ also the most feasible in real-world 
settings , the leas t costly v and the most commonly used. ^ r 

" One additional point zshould be ^ntio^ned here. The orientation 
^bf this report is that of identifying educational projectis wM - 
be considered clearly exemplary. : Unfortunately, in minimizing the prob- 
ability of identifying an unsuccessful project as successful, the d^clf^iphr 
tree procedures somewhat increase the probability of rejecting project^ 
which may really be successful. If the goal were to identify, unsuccess- 
ful projects for the purpose of terminating them rather than successful 
projects for replication purposes, a different orientation would be more 
appropriate.; - /- - : 




II. PRELIMINARY SCREENING OF CANDIDATE PROJECTS 



The process of selecting and validating exemplary educational 
projects is -'iewed as iterative in nature with each criterion area 
examined at several preliminary levels before analysis is undertaken 
at , the depth which will ultimately be required. The specific steps 
to be taken and the criteria to be used will vary as a function of each 
study •s particular objectives.. The variations, however, should not 
represent major departures from the general strategy which was employed 
in selecting exemplary compensatory education projects for packaging. 
This strategy is described below, ' 

:^ r The process began with defining the population from which projects 
were to be drawn, assembling a list of candidate projects, and solici- 
ting available documentation from each of them. When these tasks were 
completed, che investigators had in their possession an incomplete 
icollection of reports, data, and promotional literature on each can- 
didate project. _ " 

J Winnowing this information, identifying and obtaining needed sup- 
plementary data, and weighing the resulting evidence was a complex 
task. It required. a substantial investment of effort including mail 
and telephone communication with project personnel and usually at least 
one site visit. Typically^ it was not feasible to apply the entire 
process to all candidate projects, and some preliminary screening pro- 
cedures were required. Projects which passed the preliminary screening 
criteria are considered "possible" candidates for validation and all 
criterion areas were systematically investigated in greater depth. 
When there was doubt as to whether or, not a project had met one of the 
preliminary criteria, the project was not rejected immediately, but 
attention was focused on the specific criterion in question so that 
definitely unsuitable projects could be identified and rejected with 
a minimum of superfluous effort. 



Appendix A contains a set of worksheets which were developed to 
facilitate the preliminary screening of .compensatory education projects 
which were candidates for exemplary status, labile the specific cri- 
teria applied to this screening effort may not be widely applicable 
without modification, the worksheets should serve as useful models 
fo'r any similar types of screening. 

The first page was filled in for every candidate project and, 
when completed, provided a record of the disposition of the project. 
The first two sections, "Description" and "Prerequisites," were com- 
pleted as the first step in processing information received from a 
project. Information under these headings served to verify that the 
candidate project did indeed come from the population being considered. 
-The third heading, "Final Assessment" was used later to summarize the 
results of the investigations in each of the four major criterion areas. 

The second page, "Preliminary Screening Criteria^" comprises a -^{r_ 
checklist which was used for all projects which met the prerequisites. ^ 
A project which clearly failed to meet any of the criteria was re jecxed;^ 
without evaluating the other criterion areas, and, where doubt existed, 
effort was focused on the questionable area to avoid expending possibly 
fruitless effort on the others. Projects which survived tue initial 
screening were subjected to additional investigation in all areas, tage 
three was used to summarize information resulting from these additional 
investigative steps in the availability, cost, and replicability criter- 
ion areas. Page four was used to describe the tryout design in such a 
'way as to provide a context for considering the evidence of effectiveness. 

The use 6f forms such as those included in Appendix A for summarizing - 

and recording preliminary screening information may give the misleading 

' _____ _ " " " - . " ^ - - . ' ' " f 

impression that the screening process is quite rigorous. In fact, it is 

no more than a coarse grouping procedure whereby educational projects 

are categorized as (a) apparently meeting .the selection criteria, (b) 

apparently not meeting the selection criteria, or (c) can*t tell. Even 

^ the distinction among these groups is not at all clear-cut in the 

effectiveness area where misuse of experimental designs and statistical 
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procedures is quite common and affects results in ways that are not^ 
easily decipherable. = z - ' 

* It was decided that the detailed validation procedures would be 
applied solely to projects which appeared, on the basis of preliminary 
screenings, to meet the selection criteria* Only if the number of 
such projects which survived validation was inadequate would it be 
neceissary to dip into the "can't tell" category. At that point, 
validation procedures would Be applied to those projects which the 
investigators felt were most promising based on whatever circumstantial 
evidence they could assemble. 

This process would continue, one grgj^jL^at^-t^mer^iitftTr^the^ 
the •*quota" was filled or until it became clear that the original 
classification had been excessively optimistic and that the probability 
of finding additional successes was so remote as to suggest abandoning 
the search. : \ - r -- i t 
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III. EVALUATING PROJECT EFFECTIVENESS 

Assessing the effectiveness of an educational project presents an 
intrinsically difficult problem. The evaluator faces many pitfalls 
which may be broadly grouped into the three categories of measurement, 
experimental design, and statistics. Hazards exist in each of these 
areas which may completely invalidate any inferences he might draw 
aboutTnroj'etrt'rimpact. — ^ 

Conventions fpr experimental design and associated statistics have 
been developed to deal effectively with evaluation problems in controlled 
experimental settings. Standard reference books describing these con- 
ventions are widely available (e.g., Winer, 19710 and are well known _ 
toVim>st evaluation specialists. Unfortunately, in the real world of 
education it is often impossible to employ rigorous techniques, and it 
Is .extremely rare to f ind a compensatory education project which satis- 
fies all,: or even most of the fundsunental principles of . good research 
design. The pr/- .lem is so widespread, in fact, that if one were to 
reject all projects with less-than-ideal evaluations, the possibility 
of finding even a few exemplary projects would be extremely remote. 

Many of the weaker designs have been discussed at length by 
Campbell and Stanley (1963) along with the "threats to internal and 
: external validity" associated with each. Thesfe authors, however, have 
hardly touched upon the related problems of educational measurement. 
Scoring, scaling, and norming considerations become particularly im- 
portant in those designs which employ non-comparable comparison groups 
or^no comparison group at all. - 

The extent arid complexity of the experimental and measurement 
problems made it clear that a systematic procedure was sorely needed _ ^ ^ 
for reviewing project evaluations, for identifying and assessing the 
impact of their shortcomings, and for making reasonable judgments 
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regarding project effectiveness while carefully weighing all relevant 
factors* To meet this need, a 23-step decision tree was developed. 
The decision tree was designed to insure examination of each of the ' 
12 threats to valid inference discussed by Campbell and Stanley (1963) 
as they relate to specific evaluation designs. It also encompasses 
other important considerations such as the type of scores -^^-^h 
statistical operations are performed (raw, standard, sc oe^-wentile, 
grade-equivalent) , whether comparisons are made against control groups 
or are norm-referenced, and the bases on which treatment-control 
(or norm group) comparisons are made (posttest scores, gain scores, 
covariance analysis, etc.). 

A procedure of this type cannot, of course, be applied in a vacuum. 
It must be tied to pre-established criteria to which each judgment can 
be related. These criteria include (a) the minimum increment of cog- 
nitive benefit which will be considered educationally significant and 
(b) the minimum non-chance probability level which will be accepted as 
statistically significant. . 

It should be pointed out that the establishment of criteria for 
educational and even statistical significance is a matter of policy 
decision-making and has only tenuous ties to "science." There are 
associated measurement problems, however, which represent scientific 
challenges of a non-trivial nature* Most educators, for example, will 
agree that the goal of compensatory education is to raise the achievement 
levels of disadvantaged children from some starting point to air end point 
which is closer to the national norm. The question, "How much closer?", 
must be answered by the policy makers. Once this criterion has been 
agreed upon, however, the problem of how to measure the improvement 
must be resolved. 

The use of grade-equivalent scores has appeared to offer a convenient 
solution to the problem. It is intuitively logical that, regardless of 
how far below the national norm a child may be, if he makes gains which 
are greater than month-for-month he will improve his status. It is 
also intuitively logical that if he makes gains which are less than 



month-for-month, he will fall farther behind the national norm. Un- 
fortunately, these fundamentally sound concepts do not stand up in 
practice* 

Because cognitive growth is not a linear function of time either 
between or within years, because test publishers do not collect enough 
normative data to construct more meaningful raw-to-grade-equivalent^ 
score conversion tables, and because a lot of interpolation, extrap- 
olation, and curve-smoothing is alwc^ys involved, grade-equivalent 
scores simply do not behave in a fashion which is consistent with in- 
tuitive or logical expectations. These and other technical problems 
associated with grade-equivalent scores and grade-equivalent gains 
are discussed in detail later in this report and examples of some of the 
incoherencies which actually occur in real-world sx'^uations were presented 
in the Introduction, Here it is sufficient simply to say that such 
scores do not provide a suitable medium for measuring the achievement 
gains that may result from compensatory education projects. 

Even if grade-equivalent scores possessed the characteristics 
which they are typically presumed to have, the month-for-month measure 
of effectiveness would be deficient in that it would systematically 
discriminate against projects serving the most severely disadvantaged 
children. This systematic bias stems from the fact that increasing 
an achievement growth rate from 0,9 to 1,0 months-per-month is clearly 
easier than raising one from 0.7 to l.,0. A more equitable measure 
would be one which is independent of\he initial degree of disadvantage- 
ment of the children being served. 

A criterion of this type must be defined in terms of an equal- 
interval scale with some sort of anchor point. Normalized standard 
scores referenced to a national average appear to offer the most appro- 
priate medium in which such a criterion can be cast. Using unstandard- 
ized and/or criterion-referenced tests requires that success be defined 
in some other manner, and there can then be no assurance of equitability 
over the entire range of initial disadvantagement. 

These considerations led the authors to advocate a definition of _ 
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educational significance which was expressed in terms of standard score 
gains referenced to the national norm. A gain of one-third standard 
deviation was subsequently agreed upon as the criterion to be used 
for determining exemplary status* Under these conditions, for a pro- 
ject to be considered for packaging, the mean posttest standard score 
of project participants had to be one-third standard deviation higher 
with respect to the national norm than the mean pretest score of the 
same children* 

Criteria for gain are project specific, and in other projects 
even the desirability of equitability across all levels of initial 
disadvantagement might be offset by other considerations* The 23-step 
decision tree was developed so as noj to be irrevocably tied to either 
standard scores or to gains of one-third standard deviation* It is both 
more general and more permissive than the specific criteria which were 
adopted for selecting exemplary projects under Contract No* dEC-0-73-6662* 
It is, in fact, independent of any specific criterion* 

Many if not most of the steps in the decision tree explicitly call 
for judgments from the evaluator* At each step it is assumed that the 
eyaluator is thoroughly familiar with the issues involved and is qualified 
to make a judgment based on complex technical considerations* Each 
decision-tree step is accompanied by a discussion which is intended to 
define the question that is to be answered, but little or no attempt 
is made to explain the underlying problems* Such explanations are 
included in separate appendices in instances where commonly accepted^ 
principles or practices are discredited and where new or unusual approaches 
are endorsed* 

It is assumed that the evaluator is familiar with the relevant 
statistical tools and will apply them appropriately in making his decisions 
For this reason, standard statistical procedures are discussed brietfly, 
if at all. More importantly, it should be pointed out that educational 
evaluation is, and probably will continue to be, an inexact science* 
Even where the most powerful designs are used, it will be possible to 
generate plausible hypotheses attributing the observed results to some 
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Influence other than the instructional treatment or to factors unique 
to the tryout site in question^ Where weaker designs are employed, 
it will be highly desirable, or even essential, to strengthen the 
validity of inferences regarding project effectiveness by amassing 
as much supporting evidence as possible. In any case, consistency 
of findings across several replications of an evaluation study would 
constitute the most convincing kind of supporting evidence. 

Figure 1, on page 46 summarizes the 23-step decision tree in flow- 
diagram form. Each step is discussed separately on the pages pre- 
ceding Figure 1. (This page arrangement is intended to facilitate 
reference to the fold-out figure.) 

The particular path to be followed through the decision tree 
depends, of course, on the specific design employed in the evaluation 
study under consideration, but each path is structured so as to focus 
attention on the design, analysis, and interpretation pitfalls likely 
to be encountered using that model. Unless a project has been eval- 
uated in several different ways, substantially fewer steps will be 
required than the 23 which comprise the entire decision tree. Pages 
5, 6, and 7 of Appendix A are worksheets for summarizing design char- 
acteristics and evaluation decisions. 

One other point which should be made with respect to the decision 
tree relates to the fact that it has a number of exit points laoeled 
REJECT. The intent of these exit points is never that the project be 
rejected as unsuccessful. What is rejected is not the project but the 
evaluation data which, if the decision-tree process has been carefully 
followed, have been shown to be inadequate as a basis for reaching any 
conclusion with respect to the success or failure of the project. 

It should be clear from the above and, indeed, from the decision 
tree itself that exacting compliance with the conventions of experiment 
al design is not generally feasible in real-world educational contexts. 
Throughout this report the explicit emphasis given to the subjective 
components of the evaluation process constitutes a deliberate attempt 
to avoid the misleading impression of algorithmic rigor that might 



result if the role of judgment were obscured by rigid procedures, ar- 
bitrary criteria, and dubious tests of statistical significance. 




IV. DECISION TREE FOR VALIDATING STATISTICAL SIGNIFICANCE 



Step 1 



Question Are the test instruments adequately reliable and valid for 

the population being considered? 



Yes Proceed to Step 2 

No Reject test scores as measures of 
project success 



Comment 



Appropriate temporal^ stability reliability estimates should 
be used. In general, this means test-retest, or alternate 
forms estimates rather than measures of internal consistency 
such as split-half. Unfortunately » test-ret est or alternate 
form reliability information is often omitted from test pub- 
lishers* manuals. Reliability coefficients are seldom 
available for disadvantaged or other special groups. A 
rough reliability estimate for a treatment group with a 
restricted range of test scores (e.g. » bottom quart He) may 
be obtained from the following formula (Guilford » 1965 » 
p. 464): 



1 - 



^ norm ^ ^ ^XX(norm) ^ 



where r™ = reliablity for the treatment group 

^XX(norm) reliability for the norm group 



s^ " treatment group pre- or posttest 

standard deviation (whichever is 
smaller) 
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a » norm group standard deviation 

norm 



This formula is based on the assumption that the error 
variance for the treatment gtonp is equal to the error 
variance for the total norm group. If the experimental 
group error variance is actually higher than that for 
the norm group this estimate of test reliability will be 
too high (see Stanley, 1971, p. 362). Floor effects will 
further lower reliability for a group in the tail of a 
distribution, and a judgment must be made as .to the mag- 
nitude of these effects (see Step 2). 

The primary validity concerns are (a) whether the tests are 
sensitive to any gains students may be making (judgment 
based on comparison of the test content with program con- 
tent is required) and (b) whether the tests are generally 
sensitive to improved reading or arithmetic skills. Widely 
recognized standardized tests may be accepted unless there 
appear to be glaring problems. Special purpose tests must 
be examined closely, and a judgment must be made. Appendix 
B discusses considerations relevant to criterion-referenced 
tests. 

It should be kept in mind that test administration and 
scoring procedures may have important effects on rellabil-- 
ity and validity* Unless the procedures outlined in the 
publisher's test manual are followed closely, the obtained 
scores may seriously misrepresent achievement levels. This 
problem is particularly acute where the effectiveness of 
an Instructional project is assessed by means of norm-group 
comparisons. 




step 2 



Are pre- or posttest score distributions of any groups 
curtailed by ceiling and floor effects? 

Yes Proceed to Step 3 

No Estimate the size of the effect, record 
on the worksheet, and proceed to Step 3 

Ideally, the lowest scoring pupil should score above the 
chance level on the test and the highest scoring pupil 
should score below the maximum possible score. The actual 
chance level is difficult to estimate since it depends on 
the guessing strategy of each student. For students who 
guessed randomly on all items they didn't "know," chance 
would equal the number of Items divided by the number of 
response alternatives per item. However, students often 
leave items blank even when instructed to guess, and when 
they do guess, their choices are not necessarily selected 
randomly from all available alternatives. Because of 
these problems, the most practical way of identifying 
floor or ceiling effects is inspection of score distribu- 
tions for excessive skewness. If the treatment children 
encounter the test floor on pretesting, or the ceiling on 
posttesting, their gains will be underestimated. Gains 
would only be overestimated where the ceiling was encount- 
ered on pretesting and/or the floor on posttesting. This 
improbable event could occur where different levels of a 
test were used for pre- and posttesting but there is 
generally enough overlap between levels so that this type 
of situation does not arise. 

If the experimental design employs a control group, it 



would be subject to similar estimation errors which would 
then need to be_ considered in combination with those of 
the treatment group* 
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Step 3 



Is there reason to believe that the pretesting experience 
may have been at least partially responsible for the ob- 
served treatment effect? 

Yes Estimate the size' of the effect, record 
on the worksheet, and proceed to Step 4 

No Proceed to Step 4 

If standardized tests are used, and the experimental 
design employs a control group, the pretesting experience 
should have little or no effect on the outcome of the 
evaluation. Pretesting with criterion-referenced tests 
may sensitize pupils as to what they are expected to 
learn. This sensitization may interact differentially 
with the learning experiences available to treatment 
and control pupils so as to produce greater learning of 
criterion items in the treatment group. 

A more serious problem arises where there is no control 
group because, as Campbell and Stanley (1963) point out, 
•'students taking the test for the second time, or taking 
an alternate form of the test, etc., usually do better 
than those taking the test for the first time [p. 179]." 
Since, presumably, children in the norm groups took the 
test only once, this spurious increment would be present 
only in posttest scores of the program participants and 
could thus lead to erroneous conclusions regarding pro- 
ject impact. A compounding of this effect would almost 
certainly occur if pretesting was the children's firft 
test-taking experience. Under these conditions, pretest 
scores might be artificially low. 
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Assuming some test-taking sophistication, a rule-of-thumb 
estimate for the size of the practice effect would be one 
tenth of a standard deviation if the same form of the test 
were used for both pre- and posttesting (Levine & Angoff , 
1958). Use of alternate forms would significantly reduce 
this effect. 
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Step 4 



Is there reason to believe that knowledge of group mem- 
bership may have been at least partially responsible £or 
the observed treatment effect? 

Yes Estimate the size of the effect, record 
on the worksheet, and proceed to Step 5 

No Proceed to Step 5 

Knowledge of group membership may produce the Hawthorne 
effect In members of the treatment group or the "John 
Henry'* effect (Saretsky, 1972) In the control group. 
[The Hawthorne effect Is the occurrence of a performance 
Increment which results, not from the efficacy of a par- 
ticular treatment, but simply from an awareness that some 
thing special Is being done. See Whitehead (1938) and 
Parsons (1974) for further explication* The John Henry 
effect arises when those who_ do not receive special treat- 
ment make an extra effort in an attenq;)t to demonstrate 
that they can do just as well without it.] There are 
other spurious Influences of this type which may also 
confuse the issues. Children may deliberately score poorly 
on a test in order to get Into a special program or to 
keep from graduating out of a program they enjoy. They 
may also score poorly to punish a teacher or developer 
they dislike. . 

In theory, many of these effects could be experimentally 
controlled through use of a placebo treatment as is com*- 
monly done in medical research. In practice, however, 
this approach is not feasible and the educational re- 
searcher is left in the unenviable position of having 



no experimental or statistical technique for controlling 
such Influences. Although they have a tendency to dis* 
slpate with time, the researcher has no real recourse but 
to rely on his own experience anu judgment in deciding 
whether treatment outcomes should.be attributed to 
treatment effects or slnq>ly to knowledge of group member-- 
ship. * 
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step 5 



Is there reason to believe that student turnover may have 
been partially responsible for the observed treatment 
effect? ^ 



Most often, educational evaluations restrict their reporting 
to Include only pupils for whomi-both pre- and posttest 
scores are available. Pupils for whom complete data are 
not available are likely to be systematically different . 
from others (lower socioeconomic status, more mobile 
families, higher absenteeism r^^te, higher dropout ratf , 
etc.). For this reason, care must be exercised not to 
generalize the findings of the total group which was pre- 
tested. ^ , . ^ 

Where pretest and posttest scores are reported on groups 
•which are not identical (i.e., some children have pretest 
scores only and others have just posttest scores), systema- 
tic biases may be present. Students who dropped out, for 
exanq>le, may have been the lower scorers and thus have 
contributed to a spuriously low mean pretest score and 
spuriously high apparent gain Pupils entering a project 
after it begins may also be atypical and may cause posttest 
scores to be either too high or low. If differential turn- 
over is observed between the treatment and the control 
groups, explanations should be sought out and their impact, 
on the evaluation findings should be carefully assessed* 



Yes 



No 



Estimate the size of the effect, record 
on the worksheet, and proceed to Step 6 

Proceed to Step 6 
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step 6 



Does the evaluation employ a control group? 

Yes Skip to Step 14 
No Proceed to Step 7 

Control and norm groups serve identical puposes in 
evaluation designs, namely to provide an estimate o£ 
how the treatment group would have done 1£ it had not 
received the treatment. The difference between the no- 
treatment expectation^ and the observed performance fol- 
lowing treatment exposure constitutes the treatment 
effect. The term "control group" is used loosely here 
to connote any comparison group other than a norm group, 
yhile the distinction between comparison and norm groups 
is not entirely clear cut, it is assumed that the data 
available on norm groups are cross-sectional in nature 
and do not include scores on individuals while data 
from typical control-group designs are longitudinal 
records of individual students. The latter are amenable 
to covariance analysis, while the former are not. 

If some kind of control group is not employed in the eval- 
uation design, gains made by the treatment group must be 
evaluated through norm-referenced comparisons. Compari- 
sons of this type are usually reported in terms of either 
grade-equivalent gains or son^e measure of movement with 
respect to the national norm such as mean percentile shift. 
Such norm-referenced comparisons are discussed in the 
branch of the decision tree which begins with Step 7. 
Control group designs are discussed in the branch beginning 
with Step 14. 
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Step 7 



Were pretest scores used to select the treatment group? 

Yes Estimate the size of the regression 
effect, record on the worksheet, and 
proceed to Step 8 

No Proceed to Step 8 

It is often the case that children with the greatest 
educational need are selected for program participation 
from a larger group of children*- If this selection is 
based on achievement test scores which are subsequently 
treated as pretest measures, a spurious negative cor- 
relation is produced between pretest performance and gains 
from pre- to posttest. This spurious relationship arises 
from the fact that scores at the low end of a distribution 
reflect a preponderance of negative measurement error while 
those at the high end reflect a preponderance of positive 
measurement error. Immediate retesting of the extreme 
groups (using an alternate form of the test) would show 
the so-called regression effect whereby the mean scores 
^f these groups would move closer to the original total- 
group mean than they were on the original test. 

The magnitude of the regression effect can be approxi- 
mated by estimating the mean pretest "true" score from 
the test reliability. To obtain this estimated mean true 
score, the difference between the observed mean and the 
population mean must first be expressed in standard devia- 
tion units. The difference is then multiplied by the test- 
retest or alternate-form (not split-half) reliability co- 
efficient presented in the test manual. The product may 
then be "translated" back into the units of the observed 
mean score to yield the estimated mean true score. 
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Step 8 



Are normative data available for testing dates which can 
be meaningfully related to the pre- and posttesting of . 
the program pupils? 

Yes Proceed to Step 9 

No Reject norm-group comparisons as adequate 
evidence of project success 

Some test publishers have collected normative data at 
more Chan one point during the school year while others 
have relied on a single data point per year. In either 
case, it is common practice to publish separate noriTiS 
tables for the beginning, middle, and end of each school 
year. Obviously, some of these norms are constructed 
through processes of interpolation and/or extrapolation. 
These constructed norms, while possibly useful for coun- 
seling or diagnostic purposes, are likely to be in error 
by amounts large enough to invalidate any inferences 
drawn about cognitive growth. They should never be used 
for assessing the impact of educational influences. 

Where real (as opposed to constructed) norms are used, 
they should be thought of as representing data from a 
contro: group. While even the most naive evaluators would 
recognize the folly of testing the treatment and control 
groups at significantly different times, test publishers* 
suggestions that their norms are valid over three- or even 
four-month periods are rarely questioned. Clearly, however 
the treatment group is being compared to a norm group test- 
ed at specific times, and unless the testing times of the 
two groups correspond very closely, any comparisons are 
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likely to be quite misleading. Ideally, the treatment 
group should be tested at times exactly corresponding 
to real normative data points. If this is not possible, 
linear interpolations or extrapolations of a month or 
even two months from the specific testing dates on which 
the norms are based should not introduce large error ' 
components. Certainly, it is better to interpolate or 
extrapolate than simply to use the given norms when the 
testing times differ. (See also Appendix D.) 

Another possibility, where testing times were non-comparable, 
would be to make explicit the comparisons which were made. 
An example of this approach might be as follows: "The 
mean score on the pretest (administered at grade level 7.1) 
fell at the 24th percentile of the grade 7.6 norm group 
while the mean score on the posttest (administered at grade 
level 7.8) was at the 36th percentile of the 8.6 norm group." 
While this approach may be somewhat confusing, it is scienti- 
fically sound whereas other commonly employed approaches 
(e.g., use of constructed norms) are simply not meaningful. 
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step 9 

Do the norms provide a valid baseline against which to 
assess the progress of the treatment group? 

Yes Proceed to Step 10 

No Reject norm-group comparisons as adequate 
evidence o£ project success 

Ideally, the norm group should be a representative sample 
of the population from which the treatment group is drawn. 
Thus, disadvantaged children should be compared against a 
disadvantaged norm* While some work toward the develop- 
ment of such norms has been accoinplished, only nationally 
representative norms are available for most standardized 
achievement tests. 

When groups of disadvantaged children are compared against 
"national" norms they are compared against a composite of 
subgroups, some of which may be like them while others are 
certainly not (e.g., non-disadvantaged "late bloomers")* 
For comparisons to be valid, these subgroups must maintain 
the same relative positions with respect to one another 
over time, as significant among-group changes would in- 
dicate differential group growth rates with respect to 
the overall norm. At the present time, there is no evidence 
that different group growth rates occur (despite the imp- 
lication of "late blooming"). Thus, while there are 
potential hazards in using nationally representative norms 
to assess the progress of atypical groups. It does not 
appear unreasonable to do so. 

Where treatment groups are clearly special (e.g«, non- 
English speaking), national norms should not be assumed 



to constitute a meaningful basis for progress assessment. 

One further comment should be made with respect to 
normative data for grades above the elementary level. 
Since dropouts come largely from the low end of the dis- 
tribution, the norm will tend to move up at each grade 
level with respect to the non-dropouts, thus producing, 
an apparent negative effect on their cognitive growth 
rates. 
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step 10 



Is the comparison between the treatment group and the 
norm group based on pre- and posttest scores or on gain 
scores? 



To determine whether observed pre- to posttest gains 
exceed no-treatment expectations, it must be possible 
to derive the expected values from the appropriate norms 
table. Generally this derivation requires knowledge of 
the mean treatment group pretest scores since the no- 
treatment expectation is that the group will maintain its 
percentile standing with respect to the national norm 
from pre- to posttest. Simply knowing that the treat- 
ment group made a mean gain of 29 raw score points would 
not su£f?.ce to determine the no-treatment expectation. 

Grade-equivalent gain scores a ppear to be an exception to 
this general rule. It seems that simply expressing gains 
in terms of grade-equivalent months per month of project 
^icposure automatically provides a comparison with "the 
average child". Not only is this appearance erroneous. 



but^^^caling and other problems associated with grade 



equivalent gains, arc so severe that these scores are more 
misleading"£liah useful (see Appendices D and E) . 

Gain scorer^ derived from "regular" standard scores (as 
opposed to expanded standard scores) constitute the only 
real exception to the need for pretest scores in norm- 
referenced evaluations. Where such scores are provided 



Pre- and Posttest scores 



Gain Scores 



Proceed to Step 11 
Skip to Step 12 
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(e.g., for the Gates-MacGinitie) the no-treatment expected 
gain is 0.0 points. Unfortunately, very few publishers 
include ^'regular" standard scores in their test manuals/ 
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Step 11 



Have appropriate statistical tests been employed to assess 
the significance o£ the gain in treatment group performance 
relative to the norm group? 



The gain of the treatment group with respect to the norm 
is determined by subtracting the expected mean posttest 
score from the observed mean posttest score. To find the 



1. Determine the percentile equivalent of the mean 
pretest raw, or, preferably, standard^ e landed 
standard, or scale score. 

2. Enter the norm table appropriate for the post- 
test with the pretest percentile and read out the 
corresponding raw, standard, expanded standard, 
or scale score (the type of score must correspond 
to that of the observed mean posttest score) • 

The statistical significance of the treatment effect can 
be assessed using the following formula: 



No 



Yes 



Skip to Step 23 
Skrp to Step 13 



expected mean posttest score: 



Y - Y 



t, 



N-1 



/ 



s 



N 



- 1 
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s observed mean posttest score 
« expected mean posttest score 

- pretest standard deviation 
posttest standard deviation 

s correlation between pre- and posttest scores 

» number o£ children 

- degrees of freedom 

Using this formula assumes, of course, that normative data 
are available for testing dates comparable to the pre- and 
posttest administration times (see Step 8); 

Some test manuals provide simplified procedures for deter*- 
mining the significance of a gain from pre- to posttest. 
These procedures should not be used, however, as they 
incorporate assumptions about the correlation between pre- 
and posttest scores which may not be applicable to the 
project participants. The significance of the gain should 
be determined from data in hand. 



where Y 
Y 

N 

N-1 
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step 12 



Are pre- and/or posttest scores available? 
Yes Proceed to Step 13 

No Reject norm-group comparisons as adequate 
evidence of project success 

Except in those unusual Instances where gain scores are 
derived from "regular^' standard scores (scores which have 
been normalized and standardized independently at each 
normative data point), it is not possible to derive gain 
expectations from them. Where gain scores derived from 
"regular" standard scores are available, the mean gain 
score can replace the numerator of the formula given in 
Step 11 and the standard error of the gain (the standard 
deviation divided by the number of pupils) can replace the 
denominator of the same equation. 

All other gain scores are uninterpretable with respect 
to expectations. Unless, therefore, it is possible to 
retrieve pre- and posttest scores, norm-group comparisons 
must be rejected as infeasible. 
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Step 13 



Question Can appropriate statistical tests be employed to assess 

the significance of the gain In treatment group perfor- 
mance relative to the norm group? 



Yes Compute appropriate statistics and 
skip to Step 23 

No Reject norm-group comparisons as adequate 
evidence of project success 



Comment 



If the mean pretest and posttest scores and the associated 
standard deviations are available, the statistical signifi- 
cance of the treatment effect can be assessed using the 
formula given In Step 11, p. 30. If these values are not 
available and cannot be computed from raw data, norm*-group 
coiq>arlsons must be rejected as Infeaslble. 
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Step 14 



Were the children, either matched or unmatched, randomly 
assigned to the treatment and comparison groups? 

Yes Skip to Step 18 
No Proceed to Step 15 

A "yes" answer to this question Implies that, prior to 
the beginning of the experiment, a pool of eligible children 
existed and each child had an equal chance of being assigned 
to the treatment group. It further implies that assignment 
was made on a purely chance basis without any knowledge or 
consideration of the characteristics of the pupils (except, 
of course, where matching was done prior to assignment) • 

If a matching procedure is employed, it should be imple- 
mented as follows. The entire pool of eligible children 
should be organized into carefully matched pairs on the 
basis of pretest scores and other potentially relevant 
variables (e.g., sex). One member of each pair should then 
be selected at random for assignment to the treatment group. 
The remaining member of the pair would, of course, be as- 
signed to the comparison group. 

Note: Matching after assignment to treatment and com- 
parison groups is a fundamentally unsound practice. (See 
Step 15.) 
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step 15 



Is there evidence that members o£ the treatment and 
control groups belong to the same population or to pop-- 
ulations that are similar on all educationally relevant 
variables including pretest scores? 

Yes Proceed to Step 16 
No Skip to Step 19 

As Lord (1967) has pointed out, "If the individuals are 
not assigned to the treatments at random, then It Is not 
too helpful to demonstrate statistically that the groups 
after treatment show more difference than would have been 
expected from random assignment—unless, of course, the 
experimenter has special information showing that the 
nonrandom assignment was nevertheless random in effect 
[p. 38] Where pre-existing, intact groups are used as 
the treatment and control groups. It is not appropriate 
to assume that they are, even in effect, random samples 
from a single population. The probability that they may 
be must be investigated empirically* At the very leasts 
the two groups must not be significantly different in 
terms of pretest scores. They should also be comparable 
in terms of socioeconomic status, age, sex, and racial 
and ethnic composition. School size and setting (urban - 
rural) as well as neighborhood should also be comparable. 
Even with these factors equated, serious selection biases 
are common. Such biases are introduced when teacher or 
student participation is voluntary or when experimental 
groups are selected by principals or teachers. 

A common design error where comparable, intact groups 
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cannot be £oun(!i Is that of matching members of the treat- 
ment group with specific members o£ other, non--comparable 
groups. The assumption here is that a comparable control 
group can be constructed through the matching process. 
The fallacy inherent in this assumption is that the selec*' 
ted subgroup is atypical of the group from which it is 
drawn and will show a regression toward the mean of that 
group. on posttest measures. Campbell and Stanley (1963) 
describe this type of post-hoc matching as "a stubborn, 
misleading tradition in educational experimentation,'' 
and as a "hazard" which is "frequently tripped over [p. 219] 
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Step 16 



Question 



Are post-treatment comparisons made In terms o£ posttest 
or gain scores? 



Posttest Scores 



Gain Scores 



Skip to Step 20 
Proceed to Step 17 



Comment 



Two types of gain scores are . .quently used In educational 



evaluation: "raw", and resldu . gain scores* Comparisons 
between treatment and control groups based on raw gain 
scores (posttest scores minus pretest scores) are Iden- 
tical to comparisons based on posttest scores where the 
between-group posttest difference has been adjusted by the 
full amount of the pretest difference* Except In the case 
where the correlation between pretest and posttest scores 
Is perfect, this adjustment Is excessive* A pretest- 
posttest correlation of less than one In^lles that the 
pretest scores reflect some variance not Included In the 
posttest scores* This variance, which Is typically called 
measurement error, may reflect a large number of extraneous 
Influences » some of which are random, while others represent 
systematic differences between the groups* In either case, 
variance due to measurement error Is not relevant to post- 
test scores and represents a portion of the pretest scores 
which should not be subtracted from thcrii* Since high pre- 
test scores have a preponderance of positive stc^surement 
error while ^low pretest scores have a preponderance of 
negative measurement error, the use of raw gain scores will 
produce a spurious negative correlation between pretest 
status and gains* In other words, the higher the pretest 
score, the lower the gain* Thus, where the experimental 
group h'^d lower pretest scores, the use of gain scores 
will licrease the probability that a non-slgnlf leant 
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treatment e££ect will appear significant. If the experi- 
mental group is initially superior, valid inferences may 
be drawn about the treatment effect it the raw gain scores 
of the two groups are found to be significantly different. 

Residual gain scores are not gain scores at all but are 
differences between observed posttest scores and posttest 
scores predicted from the regression of posttest on pre-* 
test scores for the experimental and control groups com-- 
bined. ^Wh^re the regression line for the combined group. . 
is a weighted average of the within-group regression 
lines, residual gain scores are equivalent to posttest 
scores adjusted for^ pretest differences through covariance 
analysis. This equivalence, however, does not hold except 
where the two groups have comparable pretest score dis*- 
tributions. In fact, where pretest scores are substantially 
different and posttest scores are equal, the slope of the 
combined-group regression line approaches zero and the 
residual gain technique obscures the effect of pretest 
differences completely. Since residual gaifi scores sys- 
tematically under-correct for pretest differences, their 
use Is always undesirable. Where analysis of residual gain 
scores indicates that an initially inferior treatment group 
has outperformed the comparison groups, the success of the 
treatment can be accepted. Where results under these cir- 
cumstances are non-significant, or where the treatment 
group scored higher than the controls on the pretest, the 
results of the analysis should be regarded as inconclusive 
at best. There iL a very real danger that a successful 
treatment will be r^^^jected using this procedure and some 
other form of analysis should be undei^taken if at all 
possible. 
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S tep 17 

Question Can data be obtained which would enable application of 

covarlance analysis techniques, would such analyses be 
appropriate, and Is there a reasonable expectation that 
they would produce significant results? 

Yes Conduct covarlance analysis and proceed 
to Step 23 

IMo Skip to Step 21 ' ' 

Corment Wherever pretest differences between treatment and 

control groups have resulted from random assignment 
procedures, covarlance analysis should be employed. If 
possible, to adjust for these differences. Where the 
treatment group was superior on the pretest, this type 
of analysis will significantly reduce the probability of 
Incorrectly Inferring a treatment v;as successful when It 
was not. Conversely, where the treatment group was 
Initially Inferior, covarlance analysis will significantly 
reduce the probability of rejecting a successful trer uient 
as unsuccessful. In both Instances the covarlance adjust- 
ment will Increase the accuracy of posttest measures so 
that the true magnitude of program Impact can be deter- 
mined. 

There Is, of course, no justification for the extra comp- 
utational labor required for covarlance analysis If the 
two groups obtained equal scores on the pretest. 
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Step 18 



Were pretest scores collected? 



No 



Yes 



Go back to Step 15 
Proceed to Step 21 



If assignment of pupils, to treatment. and. contr.ol groups . 
has been truly random, it is not essential to collect pre- 
test scores since valid inferences can be drawn from post- 
test score comparisons. If pretest scores are colxected, 
however, more powerful statistical tests can be employed. 
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step 19 



Question Is the control group superior to the treatment group 

on the balance of educationally relevant variables? 

Yes See Appendix C 

No Reject control group comparison as adequate 
evidence o£ project success 

- Comment "Educationally relevant variables" include, but are pro- 

bably not limited to, pretest scores, socioeconomic status, 
age, sex, racial and ethnic composition, and school and 
community factors* Where there are significant differences 
between treatment and control groups on one or more of 
these variables, "true" experimental designs cannot be 
employed. The alternative quasi-experimental approaches 
which may be adopted all rest on sets of assumptions of 
varying degrees of plausibility. If it could be assumed, 
when dealing with non-^comparable groups, that they would 
respond in a similar manner to the presence or absence of 
the variable under investigation, there would be no real 
problem* Because this assumption is untenable, however^ 
. it is generally necessary to make other assumptions about 
how their responses would differ. One such assumption 
which is relevant here and appears "safe" is that a group 
which is initially superior to another group In cognitive 
development will continue to grow at a rate equal to or 
greater than that of the initially inferior group, other 
things being equal. If, then, the initially inferior 
group outperforms the initially superior group after ex- 
posure to a special educational treatment, it is probably 
safe to conclude that the treatment was effective. On the 
other hand, if the treatment had been administered to the 
initially superior group, it would not be possible to reach 
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any conclusion by comparing Its growth rate against that 
of the Initially Inferior group* 

Other assumptions could be made which would permit the 
quantification of growth rates and thus enable compari- 
sons to be made In both directions and wlt^ the appear*- 
ance, at least, of greater precision* Assumptions of this 
type, unfortunately, tend to require massive doses of faith 
since there is little in the way of empirical data to 
support them* Until such data are assembled, the authors 
reject as inadequate any evaluations where the treatment 
group is initially superior to the comparison group* 
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Step 20 



Question Have covariance analysis techniques been employed to adjust 
for Initial differences between groups? 

Yes Skip to Step 23 
No Go back to Step 17 

Comment Where assignment to either the treatment or the control 

' group has been random or "random in effect" (see Step 15), 
analysis of covariance Is the most powerful statistical 
technique available for testing treatment effects. If the 
analysis has been done correctly. Its findings may be ac*- 
cepted at face value. 

Covariance analysis must not be regarded as a substitute 
for truly comparable groups. It can only be used where 
Its assumptions (effectively-random assignment and homo<- 
genelty of regression) are met and where Initial differences 
between groups are not excessive. It should be noted that 
even where regression Is statistically ncn-heterogeneous , 
small differences in regression line slopes Introduce errors 
Into the computations. These errors Interact In a multi- 
plicative fashion with the size of the between-group dif- 
ference. A small error multiplied by a big difference 
becomes a big error. For this reason, it is common to use 
the 10% level for rejecting the hypothesis of homogeneous 
variance. Use of the 20% level would be appropriate when 
the difference between group means is large. 
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Step 21 



Question Have appropriate statistical tests been employed to 

compare posttest or gain scores? 

Yes Skip to Step 23 
No Proceed to Step 22 

Comment A wide variety of statistical tests and procedures can be 

used for testing differences between groups* Raw or 
(preferably) standard score comparisons may often be made 
on either posttest or gain scores using parametric sta- 
tistical tests such as Student's t^ for independent means 
(t^ for correlated scores where pupils were matched prior 
to assignment to groups) or analysis of variance* However, 
the data should be inspected to confirm that the assump- 
tions of these tests have been met, since score distribu- 
tions from special instructional projects are likely to be 
badly skewed* 

Where parametric test assumptions are not met, non--parametric 
tests such as the Mann-Whitney U or the Kolmogorov-Smirnov 
test are appropriate but are less powerful than their para- 
metric equivalents* Non-parametric tests must also be used 
where comparisons are made between posttest grade-equivalent 
scores (assuming random assignment) * There is no meaningful 
way in which grade-equivalent gains can be compared* 

The cautions regarding the drawing of valid inferences from 
gain-score comparisons discussed in Step 16 should be care- 
fully observed* 



so 



Step 22 



Question Can data be obtained which would enable appropriate tests 

to be made? 



Yes Obtain data, compute appropriate 

statistics, and proceed to Step 23 

No Reject posttest and/or ^galn score 

comparisons as adequate evidence of 
project success 



Comment Where Inappropriate statistical approaches have been 

adopted, there Is no choice but to seek out the Information 
^ needed to conduct appropriate tests* I£ raw or (preferably) 
standard score summary statistics (means and standard devia- 
tions) are available, t-tests could be done* In many cases, 
unfortunately, all calculations will have been done In- 
appropriately (e«g*, by using grade-equivalent scores) and 
' It will be necessary to go back to Individual test scores 
If meaningful analyses are to be done* If this procedure 
Is followed, raw or grade-equivalent scores should be con- 
verted to their standard-score equivalents before any 
arithmetic operations are performed on them* Appropriate 
tests are discussed In Steps 17 and 21* 
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step 23 

Question Do analysis results favor the treatment group ^at the pre- 

selected level of statistical significance? 

Yes Review all evidence compiled during the 
validation process and use judgment to 
decide whether the statistical test re- 
sults can reasonably be attributed to 
project effects. 

• No Reject evidefice as being inadequate to 
validate project success 

Comment Given a statistically significant result, the attribution 

^ of cause is still at issue. The final step in relating 
an observed effect to the treatment requires careful con- 
sideration of each of the extraneous effects identified 
in proceeding through the decision tree and estimation of 
their contribution, in aggregate, to the apparent impact 
of the treatment. It is, finally, left to the judgment 
of the evaluator to assess the magnitudes of these effects, 
weigh their influence in the evaluation results, and con- 
clude whether or not the treatment was effective. 
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Fig. 1. Decision tree for validating statistical significance. 
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V. ADDITIONAL CONSIDERATIONS 



The decision tree presented in the preceding section o£ this 
report should enable reasonably unequivocal conclusions to be reached 
regarding the existence or nonexistence o£ some treatment impact. 
Difficult as that decision-making process may be, even more difficult 
questions arise in assessing the practical value of the observed 
impact. Relevant questions include, "What is the educational signi- 
ficance of a third-of-a-standard-deviation (or any other size) gain 
on a standardized reading achievement test?**, *'What is the significance 
of a five-point gain in reading comprehension as opposed to a comparable 
gain in vocabulary?**, and '*Is a moderate-cost treatment which produces 
moderate gains more educationally significant than a costly treatment 
which produces larger gains?** 

Consideration of these and related questions quickly brings to 
light the difficulty of making even gross-level decisions in the ab- 
sence of a metric for quantifying educational significance. And many 
would argue that scores on standardized achievement tests in no way * 
satisfy the requirements for such a metric* Unfortunately, the lack of 
a presumably adequate metric for educational significance does not 
relieve decision-makers of their responsibility to choose among and 
act upon the alternatives available to them* Neither does the lack 
of an adequate metric imply that all measurement is inf easible or that 
decisions must be made without u useful guidance from educational research. 
Standardized test scores do constitute meaningful indices and, if 
appropriately interpreted, go a long way toward achieving their ultimate 
objective. 

Basic to the entire quantification issue is the sometimes overlooked 
fact that educational significance is an inherently subjective concept* 
While scales may be constructed from the consensus of experts, it must 
be acknowledged that they will be culture-bound and situation-specific* 



Furthermore, there will be educators o£ substantial stature who will 
disagree with any set o£ consensus*-based priorities and relationships. 

A simple illustration can be drawn £rom standardized reading 
achievement tests where it is common practice to provide separate 
scales for vocabulary, comprehension, and occasionally other component 
skx.lls. Clearly these subtests could be weighted and combined in a 
number of different v yield a "Total Reading** score. Some 

educators might argue tu. vocabulary and comprehension are equally 
importai*!- aspects of reading while others might claxiu that comprehen- 
sion was twice — or five times — or even ten times as important as vocabu- 
lary. It is clear that th'' xo^ue cannot be adequately resolved through 
empirical research and n only be dealt with by "'majority rule" or some 
similar, equally unsatisfactory expedient. 

Despite the fervor with which this issue may be debated, the 
method of combining vocabulary and comprehension subtest scores to 
obtain a total reading score appears, upon closer examination, to be 
little more than a pseudo -problem. The two subtests are so highly 
intercorrelated (typically, r - .80) that even very different weighting 
systems have almost no impact on the ordering of total scores. In other 
words, students will fall into very nearly the same order whether comp*- 
rehension scores are given ten times the weight of vocabulary scores or 
the two scales are equally weighted. Although the empirical evidence 
may be less complete it appears that many widely debated issues in 
educational evaluation today can be deflated with the same sort of 
demonstration. Clearly, the argument that standardized achievement 
tests ought not to be used for assessing cognitive growth can be quickly 
invalidated if the correlations between test scores and other liieasures 
purported to reflect component skills more adequately are shown to be 
high. 

The conclusi'i, then, must be that standardized tests, with all 
their deficiencies, do provide a usetul metric for assessing the basic 
skills of reading and math. Standard scores on such tests, although 
not comprising ratio scales, do provide a neans of quantifying gains. 
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of relating observed gains to gain expectations in a reasonable manner, 
and of measuring the impact of special instructional projects on cog^iitive 
growth. At the same time, it is clear that they do not provide a 
complete answer to the kinds of questions raised in the first para- 
graph of this section. The difficulty in coming to grips with these 
questions lies not in determining the size of the gains. but in deter- 
mining their value * 

The value issue was alluded to above in discussing the relative 
value of gains in vocabulary as opposed to comprehension. In this 
situation, at least, the issue was shown to be a pseudo-problem and 
it was implied that many similar issues might be of far greater theore- 
tical than practical concern. The absolute value of achievement gains 
may also pale into relative insignificance when examined in the context 
of real-world contingencies. An achievement gain of "X** standard-score 
points is likely to be worth exactly the amount of money a school 
district is able or willing to spend to obtain it — and this, in turn, 
will depend on the needs of the children in the district and perceptions 
of the relative priorities existing among them. If needs can be ade- 
quately defined, relative comparisons among the alternatives available 
to fit them are sufficient. Absolute scales of educational significance 
may be required for the typical kind of cost-benefit studies seen in 
the harder science and engineering areas but educational issues need 
not be defined in that manner. 

fn their search for effective compensatory education projects to 
package, the authors decided they would consider any treatment which 
produced one-third of a standard deviation gain with respect to the 
national norm. Above that point, choices would be based on judgments . 
reflecting the size of gains, costs, replicability, availability, target ^ 
group served, variety of approach, etc. Their original guess that the 
choices would be relatively easy to make and unequivocal was substantiated. 
While this example may be atypical, it seems that the alternatives avail- 
able to fill a specific need will rarely be so numerous as to preclude 
sound decision-making by qualified, well-informed, and thoughtful judges. 
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APPENDIX A 



PCOJECT SELECTION CRITERIA WORKSHEET 
SUMMARY PAGE 

PROJECT TITLE i 

pate Initials 

DESCRIPTION 
Approach 

Pull-out - Whole class 

PREREQUISITES 
Content 

Grades 

Tryout population 
Number of tryouts 

PACKAGING CRITERIA 

I. Availability 

II. Cost 

ni. Replicability 

IV. Effectiveness 

Statistical Significance 



Educational Significance 



ERIC 
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PROJECT SELECTION CRITERIA WORKSHEET 
PRELIMINARY SCREENING CRITERIA 

AVAI LABIUTY 

Accessibility: 

Q Can be visited for validation 

Q Pe/sonnel are cooperative 

I I Procedures, results, and costs are documented 
Acceptability: 

n Operational in public schools 

n Not primarily a single commercial product 

COST 

Q Equipment plus special personnel less than $400 per pupil 
Q Initial investment less than $1000 per pupil 

Q (Alternatively) Per-pupil cost over a three year operational period 
including start-up costs should not exceed $735 

REPLICABILITY 

PI Operating programs are provisionally considered replicable unless \ 
major component clearly cannot be readily duplicated. Components 
include: materials, hardware, personnel, and environments. 

EFFECTIVENESS 

pi Norm and/or control group with comparable test dates 



52_ 

59 



PROJECT SELECTION CRITERIA WORKSHEET 
NOTES 



AVAILABILITY 
COST 

REPLICABILITY 
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PROJECT SELECTION CRITERIA WORKSHEET 
NOTES 

EFFECTIVENESS 

Description of tryout design (s) 
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ERIC 



PROJECT SELECTION CRITERIA WORKSHEET 
ANALYSIS OF PROJECT EVALUATION 

Complete a separate sheet for each validating site or combination of sites for which 
separate data are reported. 



PROJECT TITLE^ 
Tryout Group 



L Tryout Summary 

A. Treatment group description 

1. Number 

2. Grades/-^es 

3. SES/Ethnic 

4. Pre-project achievement level 

5. Schools/Classrooms 

6. Selection procedure 

7. Treatment period dates 
Hours per week 

B. Comparison group description (if same as experimental group write "same") 

1. Number 

2. Grades/ Ages 

3. SES/Ethnic 

4. Pre-project achievement level 

5. Schools/Classrooms 

6. Selection procedure 

7. Treatment period dates 
Hours per week 
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PROJECT SELECTIOl;! CRITERIA WORKSHEET 
ANALYSIS OF PROJECT EVALUATION 



C. Norm- referenced (standardized) tests 







Pretest 


Posttest 




Name 


Exp/ Cont 










Data reported 

















D. Other measures (student, teacher, parent, other) 

Criterion-referenced tests 
Intermediate/Formative data 
Opinion/ Attitude data 
Critical incidents 
Classroom grades 
Attendance/Discipline records 
Other 
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PROJECT SELECTION CRITERIA WORKSHEET 
ANALYSIS OF PROJECT EVALUATION 

Evaluation of Effectiveness 

A. Factors affecting statistical significance 

1. Adequate tests 

2. Ceiling/Floor effects 

3. Pretest effect 

4. Group membership effect 

5. Student turnover 

6. Treatment/Control analysis steps 

7. Treatment/Norm analysis steps 

B. Educational Significance 

C. Other outcomes; unexpected outcomes 
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APPENDIX B 



Norm-refierenced versus Criterion-referenced Tests 

While use of criterion-referenced tests has been advocated for at 
least ten years (Glaser & Klaus, 1962), educational projects are still 
evaluated predominantly in terms of commercial, norm-referenced tests* 
The reluctance of educators to abandon familiar testing paradigms is 
understandable in view of the continuing confusion over the exact dis- 
tinction between the conventional norm-referenced test and the new cri*- 
terion-referenced instiruments. This confusion is clearly evident in 
recent articles by Airasian and Madaus (1972), Jackson (1971), and 
Popham and Husek (1971), and in a review by Davis (1973) of eight 1972 
AERA papers on criteripn-referenced testing. 

The confusion appears to result from conceptualizing criterion- 
referenced tests as an alternative to norm-referenced tests. In fact, 
norm- and criterion-referenced tests do not represent mutually exclusive 
test categories nor do they represent the ends of a continuum. On the 
contrary, the "norm" and "criterion" descriptors r^fer to completely 
independent test characteristics, both of which should probably be 
included in the description of any test. The problem is further com- 
plicated by the fact that, although there are i^eal differences between 
tests that are labeled "norm-referenced" and those labeled "criterion- 
referenced," these labels do not capture the salient distinguishing 
features. 

The dominant characteristic of tests that are labeled "criterion- 
I erenced" is that their content is clearly defined in terms of some 
performance dimension of interest. This relationship permits direct 
interpretation of individual scores in ways which have immediate prac- 
tical implications (e.g., time required to run a mile, or proportion 
of the 3000 most frequent English words that the individual can define). 
The misleading label apparently derives from the failure to distinguish 



between the dimension being measured and the scale adopted to measure 
it. This failure is not surprising in the context of training program 
development which first popularized **criterion-ref erenced** testing. 
For example, Glaser and Klaus (1962) wrote: 

Two kinds of criterion standards are available for evaluating 
individual proficiency* First, a standard can be established 
which reflects the minimum level of performance which permits 
operation of the system. . . .At the other extreme, proficiency 
can be defined in terms of maximum system output. The stan- 
dard of measurement is then expressed as a function of the 
capabilities of other components in the system. The man loading 
a Navy gun, for example, never needs to load more rapidly than 
he receives shells from the magazine below decks. In this case, 
a fairly absolute standard of proficiency is available, [p. 424] 

In this and similar situations, it has become popular to say that, 
a performance criterion has been established and the test used in 
measuring performance need only tell us whether or not the criterion is 
reached. It might be more informative to say that the test measures a 
performance dimension (speed of loading), that system requirements dic- 
tate a specific cutoff score, and that in the interest of economy it 
would be adequate to dichotomize the speed of loading scale about this 
cutoff. Everyone below the cutoff would get a score of "too slow." 
Everyone above the cutoff would get a score of "fast enough." 

The term "norm-referenced" has rivaled "criterion-referenced" in 
terms of confusion generated. Any test becomes a norm-referenced test 
as soon as a norm group of one or more entities is defined and scores ^ 
of those entities are obtained. Of course, if the norm reference is to 
be of any use there are many properties that the test and the norm group 
must have. The required properties depend entirely on the intended use 
of the test, but one typically desires relevance and proper sampling for 
norm groups, while tests should provide reliable and efficient quantifi- 
cation. 4 

The relative independence of norm referencing and performance 
referencing can be illustrated by an instrument used to select students 
for pilot training. Successful tests for this purpose can and have been 
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developed using what are usually referred to as conventional norm- 
referenced test development procedures. It should be clear from the 
above discussion, however, that norm reference is not the salient 
characteristic of such tests. While validation groups must be used 
to develop and scale the tests, the ultimate criterion is flying 
success, and is not dependent on standings in relation to any norm 
group. Once a reliable test has been developed which corrplates 
highly with admeasure of pilot success, a single cutoff score, or 
criterion, could be determined, and applicants could be scored either 
pass or fail. 

At the same time, neither the procedures for developing the test 
nor the final appearance of the test would classify it as "criterion-re- 
ferenced." That is, it is unlikely that the population of pilot skills 
would be sampled at all. Of course, one could say that the final in- 
strument defined something called "pilot aptitude" but it is doubtful 
whether the concept could be identified from the test items or that 
one would feel enlightened to know that a person who scores "X" or 
more points on this aptitude could be taught to fly. Aii "aptitude" 
as maasured by correlated items is simply not what we usually mean by 
a performance dimension. In short, this most familiar type of test is 
neither particularly "norm-referenced" nor particularly "criterion- 
referenced." 

It should be noted that the concepts discussed above are not new 
and have been recognized by various authors (e.g., Glaser & Nitko, 1971; 
Davis, 1972). Even these authors, however, preserve the norm/criterion- 
reference categories. Regardless of the terminology which is ultimately 
adopted, it must be recognized that new and useful measurement tech-- 
niques have been introduced in the process of attempting to define and 
develop criterion-referenced tests. It should be emphasized that is is 
the categorization that is aproductive, and not necessarily the tech- 
niques which have been developed. 



Implications for Project Evaluat ion 



In contrast to the pilot-trainee selection test which was neither 
norm- nor "performances-referenced, the commercial reading and math 
achievement tests used in project evaluation are both nofm-Feferenced 
and performance-referenced. The norm group properties need little 
comment except to point out that the usual norm groups are not typical 
of disadvantaged students (see Step 9 of the decision tree) and the 
experimental groups are not tested at the same time of year as the 
norm groups (see Appendices D and E) . 

The performance dimension that is defined by standardized tests is 
somewhat arbitrary, and it may well be argued ^that substantial improve- 
ment is needed here. Raw scores are seldom reported -in-a*meaningful 
way and items are probably chosen on the basis of diSu^rimination rather 
than as a sample of a carefully defined performance domain* The prob- 
lems are almost certainly worse in testing reading than in testing math, 
but they reflect the basic difficulty in defining what is meant by 
reading skill and measuring it* _ 

While commercial standardized tests are clearly not optimal in- 
struments for research purposes, there is no reason to believe that 
tests developed according to "criterion-referenced** procedures provide 
better measures of project effectiveness in basic skill areas* Com- 
mercial tests clearly sample Important aspects of reading and math 
achievement and are relatively efficient and reliable instruments 
They also provide normative data that permit comparisons §mong projects. 
However, '"criterion-referenced" or other special-purpose tests may be 
used to assess project effectiveness if enough is known about their 
properties to justify estimating the significance of gains. One re- 
quirement, of course, is that both the statistical and educational 
significance of observed gains must be assessed against the gains which 
would be expected under non-treatment- conditions.- ^In the absence of 
normative data, the. computation of expected gains clearly necessitates 
the use of a control group evaluation model. 
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APPENDIX C 



Estimation of Treatment Effects from the Performance 
of an Initially Superior Comparison Group 

Throughout this procedural guide, the authors have taken the position 
that a comparison group which differs systematically from the treatment 
group on educationally relevant variables cannot provide a convincing 
estimate of how the treatment group would have performed on the posttest 
if they had not received the treatment. The only real exception Is the 
case in which the treatment group starts out behind the comparison group, 
and finishes significantly ahead. There are, however, several quasi- 
experimental regression models which are applicable in certain instances 
and which may permit reasonably convincing conclusions to be drawn. Where 
the required data are available and the effort appears warranted, appli- 
cation of one of these models may be indicated. Three such models are 
discussed below. These are: 

A. The Regression-discontinuity Model 

B. The Regression Projection Model 

C. The Generalized Multiple-regression Model 

A. The Regression-discontinuity Model 

The model which appears most immune to plausible alternative hypo- 
theses is the Regression-discontinuity Model (Caii?)bell & Stanley, 1963). 
A comprehensive development of this model and related statistical tests 
is available (Sween, 1971). The model requires that treatment and com- 
parison groups be developed from a single original group by assigning all 
members below a fixed pretest cutoff score to the treatment condition 
and all members above the cutoff to the comparison group. ^ Separate 



1. Step 19 of the decision tree requires that a non-comparable control 
group be initially superior to the treatment group. This restriction 
is not strictly relevant to the Regression-discontinuity Model which 
could be applied equally well to the evaluation of special programs for 
gifted students where the comparison group was initially inferior. 
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pretest-posttest regression lines are then computed for each group and 
the difference between the lines is tested at the point where they inter- 
sect the prestest cutoff value. 

The model is rigorous in the sense that, if the procedures are fol- 
lowed correctly, rejection of the null hypothesis for any reason other 
than a treatment effect is extremely implausible. There are two con- 
siderations, however, which severely restrict the applicability of the 
model. First, it is difficult in a school environment to enforce assign- 
ment to treatment groups solely on the basis of test scores, or even on 
the basis of scores reflecting both test performance and a numerical 
teacher rating. Second, the model is not sensitive to changes in re- 
gression line slopes unless these changes are accompanied by a discon- 
tinuity of the regression lines. This requirement represents a potential 
problem since compensatory education projects are often individualized 
on the basis of student need. Such individualization could produce the 
greatest improvement in those students farthest below the pretest cutoff 
scot^ thereby flattening the treatment-group regression line without 
producing a discontinuity at the cutoff point. At least one compensatory 
reading project known to the authors appears to produce this kind of 
effect. 

>< ->rt, regression-discontinuity analysis is recommended for all 
cases in waich the conditions for its implementation are met and a posi- 
tive xesult can be anticipated. It seems unlikely, however, that such 
cases will occur frequently. 

B. The Regression Projection Model 

The Regression Projection Model uses a regression line calculated 
from the comparison-group pretest-posttest distribution to estimate what 
the treatment- group post test scores would have been under a "no treatment 
condition. Like the Regression-discontinuity Model, it also requires 
dirhotomization of a total group into treatment and comparison subgroups 
about a particular pretest cutoff score. The advantage of this model 
is its sensitivity to treatment-produced changes in regression line 
slopes. Its primary weakness is its inability to distinguish treatment 
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effects from other factors which may affect the regression line. 

The model is analogous to the technique of Karl Pearson for esti- 
uiating total- group test validity when criterion measur<='S are available 
only for those who score above some selected cutoff poi'.t. It is applic-** 
able where selection (pretest) scores are avail£.i^le for an entire group, 
but where there is no indication of how the subgroup below the cutoff 
score would have done on the posttest had they been treated in the same 
manner as the group above the cutoff. 

The basic assumption of the model is that under no- treatment con- 
ditions the regression of posttest scored on pretest scores for the total 
group would be homogeneous and linear throughout the entire score range. 
The regression line for tne comparison group is taken as the estimate 
of this total group regression line, and is projected through the treat- 
ment-group distribution (see Figure CI). This projected regression line 
is then used to calculate the no-treatment posttest score estimate. 

The model should be applied with caution since the basic assumption 
of homogeneous, linear regression may not be tenable. For example, in 
compensatory projects, factors which lower the pretest-posttest correla- 
tion for low-scoring students may invalidate the model completely. Floor 
effects on the pretest and other factors leading to low pretest reliability 
at the lower end of the range are p<articularly troublesome. At a minimum, 
a good argument that such factors are not acting is required. A scatter 
diagram permitting inspection of the pretest-posttest distribution for 
irregularities is essential. 

Horst (1966), Chapter 26, provides a discussion of the underlying 
statistical issues and presents formulas for generating unbiased estimates 
of the mean, standard deviation, and pretest-posttest correlation ^or 
the total group. The estimated regression equation for the total group 
is identical to the regression equation for the restricted (comparison) 
group. Thus, one needs only to calculate the regression equation for 
the comparison group and use it to obtain estimated treatmentrgroup post- 
test scores. This equation can be written: 

= b + K 
t c t c 
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where is the slope of the comparison-group regression line and 
is its Y-axis intercept. 

If the mean pretest score of the treatment group is substituted 
for X in the above equation^ Y will be the estimated mean posttest 
score (Y^). The difference between the actual and estimated posttest 
scores can then be tested using 



t 



- 5^)2 (N - 3) 



where - proportion of ^^upils in the treatment group 

» proportion of pupils in the comparison group 

N - number of pupils in the combined group 

s^^ « weighted mean of the treatment- and comparison-group 
posttest variances 

Sy^ ■» weighted mean of the treatment- and comparison- group 
pretest variances 

b^ - slope of the comparison-group regression line 

b ~ weighted mean of the slopes of the treatment- and 
comparison-group regression lines 

The derivation of this test is not available in the literature and is - 
sketched in its entirety below. Readers net interested in this derivation 
should skip to the discussion of the Generalized Multiple-regression 
Mod^l which begins on page 73* 

Significance Test for the Regression Projection Modc l^ 

Consider first the general situation in which a regression line is 
fit to a pretest-posttest score distribution, providing an estimated 
postteBt score (Y) for each pretest score (X). The equation for the 



2. We are grateful to Paul Horst for the rationale and development of 
this test. However, the authors are responsible for the presentation 
given here and for any errors it may contain. 
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regression line may be written 

Y = bX + K 

where b - slope o£ the regression line 

K ' Y-intercept of the regression line 

Then, for each stulient, we can define a value 

A 

D = Y - Y 

which is the difference between his actual posttest score and his esti- 
mated posttest score or, in other words, the distance that h.ls actual 
posttest score is above or beJ.ow the regression line. 

Next, consider the Regression Projection Model in which a regression 

line is fit to the comparison- group data and then projected through the 

treatment- group data (Figure CI). A distance from this regression 

line can be computed for each comparison-group student. A distance 

D from the same comparison-group regression line can be computed for 

t ^ 
each treatment-group student. Because the regression line was fit to 

the comparison-group data, the mean of the comparison-group D values 

(p^) wiil be zero. However, the mean of the treatment-group D values 

(5^) will not be zero unless the mean of the treatment-group posttest 

scores falls exactly on the projected regression line, that is unless 

A " _ _ 

The null hypothesis which is tested in the Regression Projection 
Model includes three major conditions: (a) students are assigned to 
treatment and comparison conditions solely on the basis of their pretest 
(either single- or composite) scores, (b) posttest on pretesf regression 
is linear throughout the range of pretest scores, and (c) there is no 
treatment effect. If it can be assumed that the first two conditions 
are, met, and if there is no treatment effect, the regression lines of 
the treatment group, the comparison group, and the total, group should, 
all approximately coincide. Deviations of treatment- group posttest 
scores from the projected comparison-group regression line would have an 
expected mean value of zero under these conditions and a sizable 



67 

74 



departure from this expectation may indicate a significant treatment 
effect. In an experimental situation, we can test whether the observed 
mean deviation (5) is larger than would be expected under the conditions 
of the null hypothesis by computing 

On page 66 t^ is expressed as a^ function of treatment- and comparison-group 
statistics. The equation is derived as follows: 

First we recall that 

Substituting (2) into (1) we nay write (1) as - 

D2(df_) 

.^2 = -D_ (3) 

. . 

We can then develop the numerator and denominator of (3) separately: 
Numerator 

The combined mean of the D values can be expressed in terms of the 
mean D values for the two groups (all D values based on the comparison- 
group regression line) : 

D = P^D^ -f P 5 (4) 
t t c c 

But since the regression line was fit to the comparison-group data, 

5=0. (5) 
c 

Substituting (5) into (4): . 

, 5 = P^D^. (6) 

And since the mean of the D values is equal to the difference between 
the means of the posttest distribution and the estimated posttest dis- 
tribution, we can rewrite (6) as: 

5 - P^(Y^ - ^^). (7) 
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The remaining factor in the numerator of (3) is df^, the number of degrees 
of freedom for the standard deviation of D. Usually df^ is taken to be 
N-1 where N is the number of pairs of observations. However, two additional 
restrictions hold in this model. First, the coiq>arison-'group D values 
must sum to. zero and second, the mean of the estimated posttest scores 
for the treatment group is determined by the comparison group data. 
Therefore 

dfp = N • 3. (8) 
By combining (7) and (8), the numerator of (3) can finally be written 

D2(dfj^) = ^P^(Y^ - §^)J2 (N - 3). (9) 

Denominato r — 

It is well knowa that the variance of a difference between paired 
measures is equal to^the sum of the variances of the two measures minus 
a correction for the correlation between them* In the case of D values 
from the Regression Projection Model, 

where 

xq^ - the correlation between actual and estimated posttest scores 

Sy ^ the standard deviation of the actual posttest scores 

s^ » the standard deviation of the estimated posttest scores. 

Since, by definition, 

Y « b X + K (11) 
c c 



it can be readily shown that 



and 



s- « b s^ (12) 

Y. C A 



% " 'XY ^^^^ 



where is the pretest-posttest correlation for the combined group. 
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Therefore, substituting (12) and (13) in (10) 



This form of the denominator could be used for computing t^. However, 
since the treatment and comparison groups are normally analyzed separately, 
it is desirable to derive Sp as a function of the separate group statistics 
We begin by noting that the covariance between X and Y (gj^y) is defined by 

8xY ^ %Vy " nr"TT ^^^^ 



But in the Regression Projection Model 



EX + EX Y 
EXY t t c c 



EX ^\^^^c 



and 



N N 



JY . ^^ ^ ^"c 
N " N 



!!t!t. p, ^Vt 



EX Y t, EX Y 
c c P c c 

N ° N„ 
c 



(16) 
(17) 
(18) 



(19) 



(20) 



where P and P are the proportions of treatment and comparison students, 
t c 

respectively. Similarly 

= Pt^t ^^^^ 



« P X (22) 
c c 



« P^Y^ (23) 







EX^ 


N 






EX 

c = 

N 


P 

c 


EX 

c 

N 

c 


ft. 




EY^ 


N 
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_c „ ^ = p Y 
N N c c 



(24) 



Substituting (19) through (24) In (16) through (18) and then the resulting 
equations In (15) we have 



/•„ EX Y „ EX Y 



c ' 



'■tt cc^^tt cc 

(25) 



Next, we subtract the expression (P(.\^t ^c\^c^ brackets 
In (25) and add It to the second to get 



%Y 



t c . 



(26) 



X^^^ - P P X^Y - P^P X Y + (P - P 2)x Y 
tt tctc tcct c c'cc 



But we define 

%Y^ " 



!Vt 



EX Y 
c c 

%Y " N 
c c 



Also we have 



- Vt 



- X Y 
c c 



(27) 
(28) 



(P^ - ^t^^ ' " 



P P 
t c 



(29) 



and similarly 

(^c-V> - Vt 



(30) 



Using (27) and (28) In the first brackets of (26), and (29) and (30) In 
the second we have 



(31) 
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Let 



Subtltutlng (32), (33), and (34) Into (31) 
If Y = X, we have from (35) 



^X t c X 



Similarly, if X = Y 

^Y "^t C°Y 



Substituting (35), (36), and (37) into (14) 



(32) 
(33) 
(34) 

(35) 
(36) 



(37) 



D c t X t c X 



(38) 



Rearranging terms 



Finally, It can be readily shown that 



and^ that 



(39) 



(40) 



(41) 
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Substituting (41) and (42) in (40) 

which is the form of the denominator in the equation for t^ on page 66 . 

C. The Generalized Multiple-regression Model 

Where neither of the above models is indicated, it may be possible 
to apply a multiple regression model to the data, provided the evaluator 
can generate a useful null hypothesis. However, considerable caution 
and a thorough grasp of the technical issues involved should be considered 
prerequisites for any such effort. In particular, the widespread error 
of using regression models to statistically equate fundamentally dissimilar 
groups must be avoided. Campbell and Erlebacher (1970) have shown that, 
in terms of familiar "true score plus error score" models, conventional 
regression models systematically underadjust for the initial differences^ 
between such groups. More basically, it should be noted that the under- 
lying "true score plus error score" construct is purely hypothetical and 
there is little evidence to suggest that it provides a useful basis for 
equating dissimilar groups. The behavior of one such group simply does 
not tell us much about the behavior of the other. 

However, in special circumstances the Generalized Multiple-regression 
Model may prove to be applicable. In the simplest case, the first step 
in applying the model is to calculate a regression equation for the pre- 
test-posttest distribution of the combined treatment /comparison group. 
The pretest score may be con;3idered the "predictor" variable while the 
posttest score is the "criterion" variable. The variable of interest 
is the "residual variance;" that is, the posttest score variance which 
is not predicted by the pretest regression equation. 

The second step is to add a "treatment" term as the second pre- 
dictor in the regression equation and calculate the residual variance 
about the new regression line. In the simplest case, the treatment term 
is a dichotomous variable which would be given a value of "1" for each 
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student in the treatment group, and "0" for each student in the comparison 
group. There is, however, no reason why it could not be a continuous 
variable reflecting, for example, the hours of treatment exposure. 

The last step is to test the significance of the difference between 
the residual variance computed from the first prediction equation, and 
the residual variance predicted from the second equation. The addition 
of the treatment variable in the second equation amounts to adding a 
constant to each treatment group score. Graphically, the result is to 
generate two parallel regression lines passing through the means of the 
treatment, and comparison groups, respectively. The, slope of these lines is 
the weighted mean of the independent regression lines for the two groups 
and willTin general, differ from the combined group regression line slope. 
The significance of the effect is determined by testing the difference 
between the residual variances from the two prediction equations. 

The model is a "multiple** regression model in the sense that any 
nunfcer of predictors can be incorporated in the regression equation in 
addition to pretest and treatment variables (e.g., teacher ratings, SES, 
etc.). The model is **generar* in the sense that a variety of effects can 
be examined singly, additively, and interactively. For example, by 
including a "treatment group** times **pretest scores" term it is possible 
to test whether treatment and comparison regression line slopes are 
significantly different. Finally, by including squared or other power 
terms^ the shape of the regression line can be tested. 

It will probably be recognized that the simple case described above 
is the Analysis of Covariance Model, a familiar special case of the Gen- 
eralized Multiple-regression Model. The Y-axis distance between the two 
regression lines is the adjusted posttest difference. As indicated above, 
this difference will be a biased estimate if the groups are representative 
of distinct populations. A significant effect would provide a convincing 
(negative) answer to the question **Were the two groups of posttest scores 
drawn randomly from a single population?** However, such a conclusion 



74 ^ 

81 



is trivial if it were known in advance that the groups were fundamentally 
different* Similarly, it is important in all applications of regression 
models to state the null hypothesis precisely, and to consider whether 
its rejection will be of any interest. Where there is any confusion 
concerning the assumptions of the null hypothesis or the implications 
of those assumptions, regression models cannot be recommended* 
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APPENDIX D 



Effects o£ Non-comparable Testing Dates on 
Treatment Group versus Norm Group Comparisons 

An important part of the development o£ commercial achievement tests 
is the collection o£ normative data £rom a large sample of students* The 
normative data permit the transformation of raw scores into percentile 
scores, standard scores, or grade-equivalent scores which provide useful 
information about the meaning of individual raw scores in relation to the 
particular norm group in question. The importance of having a relevant 
norm group is discussed in Section IV of this report. The Importance of 
having comparable test dates for experimental and norm groups is also 
referenced there but is discussed here in greater detail. 

There is convincing evidence that learning, as reflected ,in achieve 
ment test scores, is typically not uniform over the calendar yeajs (Beggs: 
& Hieronymus, 1968). It is not possible to generalize as to the nature 
or causes of the non-uniformity, but one widely recognized factor that 
appears to operate in certain situations is the effect of "forgetting** 
ov<sr the summer months. This effect is illustrated by the hypothetical 
"observed score" line in Figure Dl. 

The normative data for many widely used commercial tests are collec- 
ted during one short interval of the school year, typically February or 
March (e.g., California Achievement Test, 1970 Ed., Comprehensive Test of 
jBasic Skills, 1968 Ed., Stanford Achievement Tests, 1964 Ed.). In 
order to estimate appropriate scores for fall and spring, the single data 
points from successive years are simply connected with a smooth curve as 
illustrated by the broken line in Figure Dl. It is obvious that, for 
the hypothetical data in the figure, this procedure systematically over- 
estimates the expected fall score and underestimates the expected spring 
score* It should be clear that if the estimated fall and spring scores 
were used as the comparison standards for special instructional programs. 
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the program might appear to give unusually good results when actually 
the improvement was exactly the same as that achieved by an "average" 
group of students. It can be s,een that observed norm-group mean scores 
in October are far below the estimated scores. This means that an ex- 
perimental class scoring exactly at the norm-gioup fall mean would appear 
to be dring very poorly when compared to the estimated fall norm-group 
mean. In the spring, assuming that they continue to do exactly as well 
as the norm group, the experimental class would score well above the 
estimate d spring score. In fact, if the estimated fall and spring scores 
of Figure Dl were used to assess the progress of a typical norm-group 
class during a given school year, one would get the erroneous impression 
that a very poor class had been transformed into a very good class. 

All types of scores which are estimated by interpolation between 
data points are likely to introduce systematic errors into educational 
evaluations. These include, in general, standard scores, percentile 
scores, stanines, and grade-equivalent scores. Grade-equivalent scores 
are characterized by additional problems which are discussed in detail 
in Appendix E. Even expanded standard or scale scores may be somewhat 
distorted by curve-fitting procedures required to achieve articulation 
_ _^ between^levels of a_^est. 

It must be emphasized that the data poir.ts in Figure Dl are purely 
hypothetical and that different, conceivably even opposite effects might 
be found with specific tests or norm groups if the data were available. 
However, in the few tests which do report normative data from two points 
during the year (e.g., Gates-MacGinitie Reading Tests, and Metropolitan 
Achievement Testf ) the effect illustrated in Figure Dl does appear to be 
present (see Appendix E, Figure E5). The implication of these data is 
— - that tests which provide normativa data for only one po int in the year 

should not be used for norm-referenced evaluation of fall-to-sp ring gains, 
and that, in general, it is not advisable to extrapolate or interpolate 
very far from observed normative data. 
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APPENDIX E 



Problems With Using Grade-equivalen: rores 
in Evaluating Educational Gains 

Evaluation reports £or experimental educational projects frequently 
present results in terms of grade-equivalent scores or grade-equivalent 
gains. The apparent simplicity and ease of interpretation of grade- 
equivalent scores has probably been responsible for their widespread 
adoption. Unfortunately, however, this apparent simplicity is entirely 
illusory, and there is ample evidence to contra indicate the use of grade- 
equivalent scores or grade-equivalent gains for any purpose whatsoever 
in educational evaluation. 

The problems with grade-equivalent scores can be divided into logical 
and scaling considerations. The logical considerations are well covered 
in many of the teachers^ g^iides accompanying commercial tests. Specifi- 
cally, a sixth grader who obtains a grade-equivalent score of four on a 
test is not really like a median fourth grader at all. Similarly, a 
second sixth grader who obtains a grade-equivalent score of eight is not 
like a median eighth grader. All that can be said is that these t]ti0 sixth 
graders obtained the same scores as median fourth and eighth graders 
reading sixth-grade material. Since their experiences, training, and in- 
tellectual growth rates have been very different from the students in 
higher or lower grades, it is not very meaningful to make implicit com- 
parisons between them — particularly since these comparisons contain no 
information as to where the two children stand with respect to the achieve 
ment score distribution of their ^sixth-grade peers._ _ _ _ ^ 

Irom a program evaluator's stan-^point, the scaling problems are even 
more troublesome than the logical ones. There are two primary considera- 
tions: first, the overall relation of "reading skill" to "school grade" 
is not linear as grade-equivalent scores would imply. Th±t makes the 
computation of mean grade-equivalent scores inappropriate. 3econd, the 
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relation of "reading skill" to •^school grade" is not well behaved (i.e., 
not smooth) over short sections of the curve. The typically jagged 
norm-group data curve is difficult to work with so test developers 
usually do some "smoothing." This smoothing introduces systematic 
inaccuracies when grade-equivalent scores are used in a project eval- 
uation. The effects of these two kinds of problems, as well as several 
others, are illustrated in the following d.^scussion by hypothetical data, 
and by actual curves from published reading comprehension scales. 

The effect of the non-linear relation between reading skill and 
school grade is illustrated schematically in Figures El and E2. Figure 
El illustrates the commonly used format for graphically representing 
student progress, in terms of grade-equivalent scores. The apparent sim- 
plicity of this format obscures important fundamental Information about 
the acquisition of skills such as reading which are typically learned up 
to a certain level, and then maintained at that level throughout adult- 
hood. 

The format of Figure E2 is probably more appropriate for representing 
reading achievement. No significance should be placed on the. exact shape 
of the curve or the values in .the figure. It is simply intended to suggest 
that the average student learns ,to read fairly well by the time he com- 
pletes junior high school and thereafter makes relatively small gains in 
reading speed or comprehension (as distinguished from vocabulary). 

The reading skill of the 50th-percentile student in each grade, as 
measured on an achievement test, defines the grade-equivalent scores for 
the grade, so values on the reading-skill axis may be directly inter- 
preted as the grade-equivalent values' for each level of reading skill. 
It can easily be seen that,won- this hypothetical curve, "half" the sixth- 
grade reading skill is represented not by a third-grade score, but by 
a second-gracie score. Similarly, a fifth grader, would be half way be- 
tween third and ninth grade in terms of reading skill, while on a linear 
scale, the half-way point would be sixth grade. 

V?hile a curvilinear relationship between grade and skill level would 
be sufficient to invalidate most mathematical operations performed on 
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grade-equivalent scores, there Is some evidence that actual learning 
curves are considerably more irregular, and that curves for faster and 
slower learners are not necessarily the same shape as those for average 
learners. In general, averaging badly scaled grade-equivalent scores 
for students of different ability levels precludes any precise interpre- 
tation of group performance. 

Table £1 presents an example of what can happen when scores on a 
non-equal interval scale are averaged. Two hypothetical students were 
chosen to represent one standard deviation below the mean and one stan- 
dard deviation above the mean, respectively, on the Gates-MacGinitie 
Reading Comprehension Scale. Normative data from grades 6.1 and 6.8 
were arbitrarily selected. In this case, using the gain computed from 
standard scores as the "correct" gain, the mean grade-equivalent score 
overestimates the true gain by 3.5 months. While the selected example 
may not be typical with respect to the magnitude of the observed effect > 
its direction will hold for any negatively accelerated curve, i.e., 
^ the shape illustrated in Figure £2. 

_ The second rtajor scaling problem results from the local irregulari- 
ties in the learning curve which are discussed in detail in Ai^pendix D. 
The primary cause of these irregularities appears to be the forgetting 
that occurs over the summer vacation. This phenomenon produces the 
commonly observed situation in which a class of children achieves lower 
raw scores on a given test in September than they did the previous June. 
As illustrated in Figure 23 for example, a single raw score could be 
the median, score for both grades 4.8 and 5.4. While logically, both 
grade-equivalent scores should be assigned to this raw score, tMs prac- 
tice is considered overly confusing, or unesthetic, and is not widely 
adopted in commercial tests. Instead, some "smoothing" of the data 
points is done as represented by the solid line in the figure. 

The smooth line is used to assign grade-equivalent values to raw 
scores. This procedure results in a single grade-equivalent value for 
each raw score but systematically exaggerates the apparent learning gains 
in experimental situations which use fall and spring teatlrig. For example. 
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TABLE El 



Mean Scores for Two Hypothetical Students 
October and May 
Gates-MacGinitie Reading Comprehension Survey D 





Raw Score 


ocanaara score 


vj r aQc ui V die u t 


Pretest - Grade 6. 1 




'•■ 




Student A (16 %ile) 


22. 50 


40.00 


. 3.95 


Student B (84 %ile) 


46. 50 


60.00 


9.60 


Mean 


34.50 


50. OC 


6.78 


Grade-equivalent 


5.40 


6.20 


6.78 


Posttest - Grade 8. 8 








Student A (16 %ile) > 


27.50 


40.00 


4.55 


Student B . (84 %ile) 


48.00 


60.00 


10.90 


Mean 


37.75 


50.00 


7.73 


Grade-equivalent 


5.95 


6.80 


7.73' 


Grade-equivalent Gain 


.55 , 


.60 


.95 
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as shown in the figure, a £l£th-grade student who scores "6** In Novem- 
ber and "12" in May Is exactly at the median of his class, but his 
grade*equl\alent scores would Indicate that he had progressed £rom 
grade 4.6 to grade 6.4 — an eighteen-month gain In six months. 

The result of using "smoothed" grade-equivalent scores is illus- 
trated graphically in Figure E4. In this figure, the broken line repre- 
sents the "national norm" in its commonly (mis) conceived linear, month- 
for-month growth-rate form. The points connected by the solid line are 
grade-equivalent scores achieved by the median child at each grade level 
as derived from the smoothed Figure E3 ctirve. The jagged curve reappears 
in Figure E4, but in this context it is inherently confusing because, im- 
plicit in the concept of grade-equivalent scores, is the notion that the 
median student's scores "should" fall along the dotted line "national 
normi" Clearly, they do not, but it is difficult to explain to the un- 
initiated why the median "grade-equivalent score" for students at grade 
4*8 is 5.4, and the grade-equivalent score corresponding to grade 5.2 is 
4.6. if a grade-equivalent score Is not, in fact, the score of the 
median student at that grade level then the interpretation of the score 
becomes so difficult as to preclude its usefulness. It appears that, 
in some evaluations, this confusion has led educators to be unduly im- 
pressed by very ordinary achievement gains. * 

It should be noted that this scaling problem is different from the 
problem of non-comparable test times for norm and experimental giToups 
discussed In Appendix D. Appendix D points out the problems in extrap- 
olating mid-year norm dalr: to fall and spring test dates. The current 
problem applies to tests which obtain fall and spring norm data but do 
not^accept the data at face valued ^ In^both cases the ptocedure is to 
artificially smooth an Irregular curve, and the effect on project eval- 
uations is to spuriously Inflate the apparent amount of learning. It is 
generally impossible to estimate from Information presented in test 
manuals how much these factors Influence test scores or even whether 
there is any effect at all In specific Instances. However, while evidence 
on the exact .;nagnltude of the effects is sparse, it seems clear that the 
effects are relatively/ pervasive. 
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An additional problem that complicates interpretation o£ grade- 
equivalent scores is the restricted range of the typical achievement 
test. In general, a single test is developed £or use in three or 
fewer grades. Most test companies develop a series o£ tests of in- 
creasing difficulty to cover the entire range of primary (and sometimes 
secondary) grades. The result is that students scoring more than a 
year or two "below grade level" may be out of the norm range that was 
used to develop the test. For example, a test designed for seventh 
through ninth grade is usually normed on seventh, eighth, and ninth 
graders. Data may also be collected from sixth and tenth graders. 
However, the manual may report grade-equivalent scores as low as second 
or third grade. Obviously, these are simply projected scores since no 
second or third graders were ever included in the norm group for the 
test. The error in estimating what median third graders would have 
scored if they had taken the test is thus added to the problem of in- 
terpreting an unequal-interval scale. 

Actual data Illustrating the above effects are given in Figures E5. 
and E6. Figure E5 displays grade-equivalent scores for the 16th per- 
centile students (approximately the mean of the bottom quartile). The^ 
scores were taken from the manual of a widely used reading test. They 
were derived from normative data collected by the Jtest developers and 
reflect the same type of data as the hypothetical smoothed curve in 
Figure E3 except that the vertical axis is scaled in grade equivalents 
rather than raw scores. The data have been smoothed, according to the 
accompanying technical manual, but the extent of the smoothing is not 
reported. 

It will be noted that within-year gains are, in general, closer to 
month-for-month than are between-year gains. We cannot tell from the 
reported information to what extent (if any) the smoothing has reduced 
this effect. It is important to keep in mind, however, that the only 
reason the effect is observable at all is that the test in question in* 
eludes nonaatlve data from two points ^in the school year: October and 
April. The reported norms which are^ plotted in Figure E5 (October, 
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February ) and May) are extrapolated £rom the two actual test dates* 
Many widely used tests preclude the detection o£ any discrepancy be* 
tween wlthln-year and between-year learning rates by collecting norm- 
ative data at only one time during the year and extrapolating to provide 
Intermediate "norm data" (see Appendix D) . 

It will also be noted that the appearance of the curve changes after 
grade six. It Is not clear what produces the change but It seems likely 
that one factor Is the relatively high drop-out rate of low scoring stu- 
dents In junior and senior high school. The sixteenth-percent lie stu- 
dent In the high-school norm group probably stood relatively much higher 
In his first-grade peer group distribution simply because first-grade 
distributions include a large number of slow students who drop out before 
reaching high school. 

Figure E6 presents data from a study by Tallmadge (1973) of all 
California Title I students* This curve is analogous to the schematic 
curve illustrated in Figure E4. It is based on a variety of tests and 
includes the effects of both smoothing and non-comparable norm times* 
These effects are undoubtedly confounded with those of other extraneous 
variables, as Tallmadge points out: 

There is some danger in interpreting ^i'^utes 1 ^and 2 jas if 
— they represented longitudinal data. They do not — the data are 
cross sectional and each year's growth is represented by a dif- 
ferent sample of pupils* For this reason it is not strictly 
legitimate to talk about losses over the summer. We do not 
know how those children repre.c«..nted by each pretest point on the 
figures scored at posttest time the year before* Still, it 
seems reasonable to assunie that many, and perhaps most, of the 
children served by Title I in the sixth grade this year were 
also served last year in the fifth grade and in earlier grades 
and years as well. Until data are acquired over at least a^ 
12-mor h interval (ideally from posttest one year to posttest 
the following year) , quest ionsof this sort must remain un- ^ 
answered* 

Hopefully, it is clear from the above discussion that the apparent 
simplicity of grade-equivalent scores obscures their basically complex 
nature. While they may serve aome purpose in individual counseling and 



90 

97 



guidance^ the purpose £or which the achievement tests were designed » 
the current widespread use of grade-equivalent scores In evaluating 
educational programsf can only be considered extremely unfortunate* 
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